NON-ASYMPTOTIC ERROR BOUNDS FOR SEQUENTIAL MCMC 
METHODS IN MULTIMODAL SETTINGS 



NIKOLAUS SCHWEIZER 

Abstract. We prove non-asymptotic error bounds for Sequential MCMC methods in the 
case of multimodal target distributions. Our bounds depend in an explicit way on upper 
bounds on relative densities, on constants associated with local mixing properties of the 
MCMC dynamics, namely, local spectral gaps and local hyperboundedness, and on the 
amount of probability mass shifted between effectively disconnected components of the state 
space. 



1. Introduction 

Sequential MCMC methods, see [11] and the references therein, are a class of stochastic 
methods for numerical integration with respect to target probability measures fi which cannot 
feasibly be attacked directly with standard MCMC methods due to the presence of multiple 
well-separated modes. The basic idea is to approximate the target distribution /j, with a 
sequence of distributions hq, . . . , fin such that = is the actual target distribution and 
such that fiQ is easy to sample from. The distributions {fj,k)k interpolate between /iq and fin 
in a suitable way and, roughly, the algorithm tries to carry the good sampling properties of 
/io over to fin- 

The algorithm constructs a system of N particles which sequentially approximates the mea- 
sures fiQ to fin- It is initialized with N independent samples from fiQ and then alternates 
two types of steps. Importance Sampling Resampling and MCMC: In the Importance Sam- 
pling Resampling steps, a cloud of particles approximating fi^ is transformed into a cloud of 
particles approximating fik+i by randomly duplicating and eliminating particles in a suitable 
way depending on the relative density between fik+i and fik- In the MCMC steps, particles 
move independently according to an MCMC dynamics for the current target distribution in 
order to adjust better to the changed environment. We focus on a simple Sequential MCMC 
method with Multinomial Resampling which is a basic instance of the class of algorithms 
introduced in Del Moral, Doucet and Jasra |llj . 

The algorithm is essentially the same as the particle filter of Gordon, Salmond and Smith 
|16j yet the application - sampling from a fixed target distribution instead of filtering with 
an exogenous sequence of distributions - is different. The most common way of choosing the 
sequence fik consists in setting fik{dx) ~ exp{— /3kH {x))7r{dx) for an increasing sequence /3k 
of (artificial) inverse temperature parameters, see, e.g., Neal |24j . a reference measure tt and 
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a Hamiltonian H chosen such that /i^ is identical to the desired target distribution. Smaller 
values of (3 surpress differences in H and thus lead to flatter distributions which are easier to 
sample using MCMC. 

Our main question is the following: How well does the mean r]^ (/) of an integrand / with 
respect to the empirical measure rj^ of the particle system approximate the integral of interest 
/^n(/)? We address this question by proving non-asymptotic error bounds of the type 

n{^^n{f)-r,^{^)?]<^^, 

where E is the expectation with respect to the randomness in the particle system and Cn{f) 
is a constant depending on the model parameters and on the function / in an explicit way. 
More specifically, our results address the following question: Under which conditions does 
the particle dynamics work well in multimodal settings where conventional MCMC methods 
are trapped in local modes? We prove non-asymptotic error bounds which depend on a) 
an upper bound on relative densities, b) constants associated with local mixing properties 
of the MCMC dynamics, and c) the amount of probability mass shifted between effectively 
disconnected components of the state space as we move from to fin- 

The results of this paper fall into two groups: We first consider a simplified model of a 
multimodal state space and derive error bounds which allow to easily obtain some intuition 
about the algorithm's ability to cope with multimodality and to study questions of algorithmic 
design in a relatively non-technical way. These are the results about Sequential MCMC on 
trees in Section [3j In Section |4] we move on to a more standard Sequential MCMC framework 
and give similar error bounds which also take into account local changes in the sequence of 
distributions and local mixing properties. 

The motivation for studying a Sequential MCMC algorithm on trees stems from the fact 
that in typical applications the state space splits into more and more effectively disconnected 
modes over time as we move from /io to /i„. We project each such disconnected mode to 
a node in a tree and consider a Sequential MCMC dynamics which moves down the tree 
from the root, one level in the tree at each step. Thus at each time k the (projected) state 
space consists of a number of nodes. Each node at level k has at least one successor at level 
k + 1. Each successor stands for one disjoint component of the original state space which 
can only be reached from its predecessor component at level k. The role of the "MCMC" 
transitions is limited to allocating particles from the nodes at level k to their successors at 
level k + 1. Particles cannot move between nodes at the same level. The latter assumption 
is in accordance with the fact that transitions between effectively disconnected components 
of the original unprojected state occur very rarely for the local MCMC dynamics applied in 
practice. We do not use any mixing properties of the MCMC dynamics since such properties 
can only be expected to have an effect within each disconnected component - they will not 
help to correct errors made in allocating particles to modes. 

We show that in this reduced setting the algorithm's approximation error can be controlled in 
terms of a constant which captures how strongly the components gain probability mass over 
time. Roughly speaking, the algorithm works well if for all j < k no disconnected component 
under fij carries much less weight than its successors under //fc. The intuition for this is 
straightforward: If a node x at time j is much less important under fij than its successors at 
level k, there is a substantial probability that there are no particles in x. If is small. 
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we may then still have a reasonable particle approximation of /Xj. But if we miss x we also 
miss its successors at level k and if these are important we obtain a bad approximation of 
Hk- Transition states with small weight create a bottleneck for the particle dynamics. These 
observations follow from a generic non-asymptotic bound for the quadratic error of Sequential 
MCMC presented in Theorem 2.3 and from Proposition 3.2 which shows how to apply this 
bound to the tree model. 



To demonstrate how these results may help to address questions of algorithmic design in a 
relatively non-technical way, we turn in Section [3.4| to a comparison of resampling particles as 
opposed to weighting them as is done in the Sequential Importance Sampling method [2A\ [3] . 
We provide a detailed analysis of an example where Sequential MCMC with resampling 
works well while the error of Sequential Importance Sampling increases exponentially over 
time. This shows that resampling with a finite number of particles can indeed overcome 
difficulties associated with multimodality in settings where Sequential Importance Sampling 
fails. 

Section |4] contains our second group of results. We consider a more standard Sequential 
MCMC setting with a sequence of mutually absolutely continuous measures {fik)k on a com- 
mon state space E. Instead of the tree structure we consider a sequence of increasingly 
finer partitions of E and assume that the MCMC dynamics does not move between partition 
elements. We show that in addition to conditions on the importance gains of disconnected 
components similar to those in the setting of trees, two additional conditions are sufficient for 
a good performance of the algorithm: Uniform upper bounds on relative densities between 
the /Ufc ensure that the importance sampling resampling step works well. Sufficiently good 
mixing within modes is needed to decorrelate particles after resampling and to explore the 
state space locally. The mixing conditions we consider follow, e.g., from local hyperbound- 
edness and local Poincare inequalities for the MCMC steps. These results are developed in 



three steps: Theorem 2.2 recalls a generic non-asymptotic error bound for Sequential MCMC 



from [26j. Proposition 4.1 shows how the constants in the bound of Theorem 2.2 can be 



controlled in terms of local L2p — Lp-stability conditions for the Feynman-Kac propagator as- 



sociated with the particle dynamics. Finally, Proposition 4.11| shows how the latter stability 
conditions follow from the more elementary mixing and boundedness conditions mentioned 
above. 



There is by now a substantial literature on error bounds for Sequential MCMC and related 
particle systems beginning with the central limit theorems in Del Moral [9], Chopin 0, 
Kiinsch [T9] and Cappe, Moulines and Ryden, [3j. See Del Moral [TO] for an overview and 
many results, and Douc and Moulines [13j for a recent contribution. Non-asymptotic error 
bounds, i.e., error bounds for a fixed number of particles are comparatively less studied, 
see, e.g., Del Moral and Miclo [H], Theorem 7.4.4 of Del Moral [TO], Cerou, Del Moral and 
Guyader [6] and Whiteley [28]. The results in the present paper are based on techniques 
which were developed in Eberle and Marinelli \14:\ [T5] for a related continuous-time particle 
system and adapted to discrete-time in Schweizer [26]. See these papers for more discussion 
of the overall approach. 

The vast majority of the Sequential MCMC literature has focused on the case where the 
MCMC dynamics mixes well for all fik- The only precursors of our results on multimodal 
target distributions appear to be in Eberle and Marinelli |14|, [T5] who consider the continuous- 
time case and restrict attention to the case where MCMC mixes well within the elements of a 
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partition of the state space which is fixed for ah /x^. The case of increasingly finer partitions 
considered in the present paper is more in hne with what is observed in typical applications 
where //q is easy to sample and /x„ is multimodal. 

The idea of reducing complicated multimodal distributions to trees, also known as discon- 
nectivity graphs, has been studied extensively in the chemical physics literature, see Chapter 
5 of Wales \27\ for an introduction. Notably, the trees we consider here are only very loosely 
related to the genealogical trees of the particle system studied in Cerou, Del Moral and 
Guyader [6]. 

A number of related results for the Simulated Tempering and Parallel Tempering algorithms 
have been proved, in increasing generality in [23l [21 [291 [SO]- Basically, tempering algo- 
rithms differ from Sequential MCMC by substituting the Importance Sampling Resampling 
steps with suitable MCMC steps between the levels fik- Technically, these results rely on 
decomposition results for bounding spectral gaps of Markov chains, see [H [22l [T7j. These 
decomposition results have the advantage that they do not rely on the assumption that 
the MCMC dynamics does not move between effectively disconnected components which we 
made. Therefore, these results can be applied directly to some simple models of interest such 
as the mean field Ising model. All these results on Tempering algorithms are restricted to 
simple partition structures with global mixing under /xq and good mixing within the com- 
ponents of a fixed partition of the state space for /xi, . . . See, e.g., Wales ^27] for many 
examples from chemical physics which correspond to more general sequences of partitions. 

The results of the present paper are extracted from a more detailed presentation in the disser- 
tation [25] . Section [2] introduces the setting, presents the generic error bound for Sequential 
MCMC from |26j and proves a variation of the latter error bound. Sections [3] and J4] contain 
our results on Sequential MCMC for multimodal targets as outlined above. Section [5|provides 
further comparison of the results of Sections |3] and |4] and discusses their implications. 

2. Preliminaries 

Section 12.11 introduces the basic notation. Section 12.21 introduces the measure-valued model 



which is approximated in the algorithm. Section 2.3 introduces the interacting particle system 
which is simulated in running the algorithm. Section 2.4 collects some basic results found, 
e.g., in Del Moral and Miclo [12] as well as the generic non-asymptotic error bound from |26j 
which is applied in the analysis of Section[4} Moreover, we prove a second, related error bound 
which will be used in Section [3} The setting introduced here covers the models analyzed in 
Sections [3] and [4] as special cases. 

2.1. Notation. Let {E,r) be a complete, separable metric space and let B{E) be the a- 
algebra of Borel subsets of E. Denote by M{E) the space of finite signed Borel measures on 
E. Let Mi{E) C M{E) be the subset of all probability measures. Let B{E) be the space of 
bounded, measurable, real-valued functions on E. 

For e M{E) and / G B{E) define by 

= / f{x)^,{dx) 

and Var^(/) by 



E 



Var^(/) = Mr)-M/)^ 
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Let {E,r) be another Polish space. Consider an integral operator K{x,A) with K{x,-) S 
M{E) for X G S and K{-,A) G B{E) for A G B{E). We define for G M{E) the measure 
/ii^ G M(^) by 

^K(yl) = / K{x,A)fi{dx) \fAGB{E). 
Je 

For / G B{E) we denote by G B{E) the function given by 

= /) = / f{z)K{x, dz) Vx G E. 
Je 

2.2. The Measure- Valued Model. Consider a sequence of Polish spaces {Ek,rk) and a 
sequence of probability measures (/ifc)fc=0) /^fc £ Mi{Ek). This is the sequence of measures 



we wish to approximate with the algorithm introduced in Section 2.3 The measures /i^ are 
related through 

Atfclj) = 7 ^ V/ G B{Ek) 

l^k-l\gk-l,k) 

for positive functions gt-i.k £ B{Ek^i) and transition kernels iTfc with Kk{x,-) G Mi{Ek) 
for X G -Efc-i and A) G B{Ek-i) for A G B{Ek). We define the probability distribution 
fik G Mi(^fc_i) by 

- ( f\ ^^k-l{gk-l,kf ) vv-P ^ DiT ^ 
fJ'k~l[9k-l,k) 

This implies fj,k{Kk{f)) = fJ'kif) for / G B{Ek). While we need this slightly more general 
setting in the analysis of Section [3j the example to have in mind is the one where the state 
spaces Ek are identical and where encompasses many steps of an MCMC dynamics (e.g., 
Metropolis) with stationary distribution ^k- In that case gk~i,k becomes an unnormalized 
relative density between /ifc-i and fik and we have fik = fJ-k- 

Next we introduce the Feynman-Kac propagator qj^k which will be the central object of our 
error analysis. Define the mapping qk-i,k '■ B{Ek) — >■ B{Ek-i) by 

, f^ 9k~l,kKk{f) 

Qk-i,k[j) — 7 

fJ'k~l[9k~l,k) 

Observe that this implies 

Mfe(/) = /Xfc-l(gfe-l,fc(/)) 

Furthermore define for < j < k < n the mapping qj ^. : B{Ek) — )• B{Ej) by 

QjAf) = Qjj+ii<ij+hj+2{- ■ ■ qk-i,kif))) 

and qk,k{f) = /• Note that for / G B{Ek) we have the relation 

fJ'jiQjAf)) = f^kif) for < j < A; < n 

and the property 

QjA%k{f)) = qj,k{f) for < j < I < k < n. 
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2.3. The Interacting Particle System. In the Sequential MCMC algorithm, we approx- 
imate the measures {fJ-k)k by simulating the interacting particle system introduced in the 
following. We start with N independent samples = (^q,...,^^) from fiQ. The particle 
dynamics alternates two steps: Importance Sampling Resampling and Mutation: A vector of 
particles (,k-i approximating fj,k-i is transformed into a vector approximating fik by draw- 
ing N conditionally independent samples from the empirical distribution of ^fc-i weighted 
with the functions gk-i,k- Afterwards, is transformed into a vector approximating /x^ 
by moving the particles independently with the transition kernel Kk- 

We thus have two arrays of random variables i^i)o<k<n,i<j<N and {ti)i<k<n,i<j<N where 
and take values in E^. Denote respectively by ¥[■] and E[-] probabilities and expectations 
taken with respect to the randomness in the particle system, i.e., with respect to the random 
variables and Denote by the a-algebra generated by ^o, • • • Cfc and ^ij • • • ^fc 

and by the cr-algebra generated by ^O) • • • Cfe-i and ^i, . . . ^A;. Denote by rjj^ the empirical 
measure of ^k, i-e., 

1 ^ 

1=1 

The algorithm proceeds as follows: 

(i) Draw Cqt ■ ■ ^^o independently from /xq. 

(ii) For k = 1, . . . ,n, 

(a) draw ik = {il^ ■ ■ ■ ,ik) according to 

j=l i=i 2-^1=1 9k-l,kKt,k-l) 

(b) draw 6 = (^fe> • • ■ , Cf) according to 

N 

F[^kedx\fk] = llKk{ii,dx^). 

(iii) Approximate fJ-nif) by 

1 ^ 

i=l 

In the following we will study, how well rj^ approximates 

2.4. Non-asymptotic error bounds. We are interested in proving efficient upper bounds 
for the quantities 

E[|r?^(/)-Mn(/)|'] 

and 

E[|r?;r(/)-/in(/)|]. 

These quantities can be controlled in terms of the approximation error of a weighted empirical 
measure i^^if) which is easier to handle. We next introduce this measure i^^if) ^^"^ recall 
an explicit non-asymptotic upper bound on 
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Define for < A; < n 



N, 



where is given by 



k-l 



V>k = Y\. for /c > 1 and ipo = 1. 

j=0 

As shown in Del Moral and Miclo [12J, (/) is an unbiased estimator for Hkif), i-e., 
^Wk if)] = f^kif), and we have 

E[l.^{f)\T,^^] = ul,{q,_,,,if)). (1) 

The connection between the approximation errors of ij^ (/) and (/) is established in the 
following lemma. 

Lemma 2.1. For f G B{En) define fn = f — fJ-nif) and denote by \\ ■ ||sup,n. the supremum 
norm on B{En)- Then we have the hound 



m^nif) - ^^nU)?] < 2Yariu^{fn)) + 2 ||/„||Lp,„Var(z.„^(l)). 



(2) 



See |26j for a proof and a similar bound on the absolute error. Thus we can indeed control 
the approximation error of rj^ in terms of the approximation error of i'^ . We next present 
the two non-asymptotic error bounds on which the analysis of the later sections relies. These 
bounds reduce the problem of controlling the particle system to the problem of verifying 
suitable stability properties of the operators qj^k- To this end for < j < n, let || • ||j be a 
norm on the function space B(Ej) such that < 00 for all / G B{Ej). Then the central 
error bound of [26] can be stated as follows: 



Theorem 2.2. For < j < k < n, let Cj^k be a constant such that for all f G B{Ek) the 
following stability inequality for the propagator qj^^ is satisfied 



max(^||l||j||gj-fc(/) ||j, lki,fc(/ 
Define Ck, Vk and by 



< c 



j,k\\J\\k 



(3) 



fc-i 



j=0 



by 



and by 



Vk = sup ^ Var^^.(gj-fc(/)) 

j=0 



f G B{Ek) 



< 1 



= sup^E feBiEk 



,N 



k < 1}. 



Furthermore define 



Ck = maxcj, Vk = maxvj and = maxe 



N 
j ■ 



Then for all f G B{En) we have 



NE 



n 

Wnif) - /^n(/)|'] < 5] Var^^. + 



(4) 
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and, if N > 2cn, 



N < 2^ 



(5) 



For the analysis of Sequential MCMC on trees we rely on a variation of Theorem 2.2 which 
is stated and proved next. The basic difference between the two theorems is that they rely 
on different expressions for the variance of (/). 

Theorem 2.3. For < j < k < n, let dj^^ a constant such that for all f G B{Ek) the 
following stability inequality for the propagator qj^/^ is satisfied, 



max 



\M\j\\Qj,k{f) Hi' hj,k{f)\\jl < dj,k 



Define v^, v^, and as in Theorem 



2.2 



and let 

k 



and dk = max_,<fc dk- Then for all f £ B{En 



we have 



(6) 



^n^(/)-/^n(/)|'l <EVarM,('Z.,n(/)) + 



and, if N > 2dn, 



j=0 



(7) 



(8) 



The main difference between the crucial conditions ([s]) and ([g]) in the two bounds is that 
^ includes the case j = k. This implies that we need a constant which allows to bound 
ll/^llfc against ||/|||. This is generally difficult since, unlike in the cases j < k, there are no 
transition kernels Ki on the left hand side whose smoothing properties may be exploited. An 
important exception is the case where || • ||fc is a supremum norm since in that case we have 
ll/^IU = ll/llfc- III setting of Sequential MCMC on trees analyzed in Section [s] we indeed 
rely on supremum norms and thus obtain better constants from Theorem |2.3[ 



Proof of Theorem 2^. We can write the variance of (/) as follows: 

1 



n<{f)-i^n{f)Y 



N 



n<{^)<{f')-i^n{fy 



+ — E 
N 



n-l 



where 



u^M) = ^;(i)z^;((?,,n(/)^) - z^;(g,,n(/)) 



.Nt 



.N 



N, 



(9) 



(10) 



This result goes back to Del Moral and Miclo [12j, for a quick verification combine (10) and 
(11) in [26]. Now note that by the Cauchy-Schwarz inequality and since i^f { ) is an unbiased 
estimator for we have for any g,h £ B{Ej) 

\E[^.f{g)uf{h)-^^,{g)^,,m 



< 

+ 

< 



\fs,{g)E[u^'{h) - f,j{h)] + f,,{h)E[u^^g) - /z,(g)]| 
E[|i.f(5)-^,(5)lkf(M-/^,(/^)|] 



llflljl|/i| 



N 



(11) 



SEQUENTIAL MCMC METHODS IN MULTIMODAL SETTINGS 



9 



Thus adding ztVar^^. (gj^ri(/)) to the definition (10) of Uj^,^{f ) and applying (11) twice yields 



mUf)] < Var^^(g,,„(/)) + Sj,n{fy/, 



N 



where 



^.■,n(/) = ||l||,|k,,fc(/)'||, + lk,,fc(/)||^ 



Applying ^ then yields for < j < n the estimate 

A parallel argument yields 

E[z.^(l)z.^(/2) - ^,^{ff] < Var^„(/) + 5„,„(/)ef , 

where Sn,n{f) = ||l||n||/^||n and thus by m 



E[z.^(l)z.^(/2) - < Var^J/) + 



2 N 
n\\J Wn^j ■ 



(12) 



(13) 



With these observations we are prepared to bound the quadratic approximation error of 
u^if): Bounding ^ using (|l2[) and (|l3| we obtain 



iVE[|^^(/) - M„(/)p] < ^ Var^^,(Q,,„(/)) + 2 ef . 

j=0 j=0 

Bounding by and inserting the definition of d„ shows (j7j) . Optimizing ( 7 ) over / with 
< 1 and over n yields 



Ne^ <vn + dn 

Choosing N >2dn and thus N -dn>f gives (Isl). 



□ 



In both, Theorems 2.2 and 2.3 the coefficient of the leading term in the error bound corre- 
sponds to the asymptotic variance in the central limit theorem for found in Del Moral 
and Miclo ([12], p. 45). 



3. Sequential MCMC on Trees 

In this section we study the ability of our Sequential MCMC algorithm to explore a mul- 
timodal state space by abstracting from the problem of mixing within modes: We consider 
the algorithm on a simple tree structure. We assume that our sequence of probability distri- 
butions (/ifc)fc lives on a sequence of state spaces {Ik)k where the states in Ik+i have unique 
predecessors in Ik- Particle movements in the MCMC steps are restricted to moving from a 
state in Ik to one of its successors in Ik+i- 



Section 3.1 introduces the model including the notation for the tree structure. Section 3.2 



states the algorithm and the error bounds for this setting. While the algorithm considered 
here should be viewed as a stylized version of Sequential MCMC which abstracts from prob- 
lems of local mixing, it nevertheless fits into the framework of Section[2| Section 3.3 introduces 
an alternative algorithm. Sequential Importance Sampling, which is based on weighting par- 
ticles instead of resampling them. In Section |3.4| we provide an extensive discussion of an 
elementary example where the error of Sequential MCMC grows polynomially in the number 
of levels n while the error of Sequential Importance Sampling increases exponentially fast. 
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3.1. The Model. Consider a sequence of probability distributions ;Uo, • • . , /U„ on a sequence 
of finite state spaces Iq, . . . , /n- Assume that each /xfc gives positive mass to each point in its 
state space Ik- Denote by B{Ik) the bounded measurable functions from 1^ to M. We define 
a tree structure on the sequence of state spaces by introducing for k G {0, . . . , n — 1} the 
predecessor function pk : Ik+i U . . . U In ^ Ik which maps x £ Ii to its predecessor in Ik for 
/ > k. We assume transitivity of the functions pk, i.e., for j < k < I and x G // we assume 
that 

Pj{Pk{x)) = pj{x). 

Denote by V{Ik) the collection of subsets of Ik- Conversely to pk, we define the successor 
function Sk ■ IqU . - .U Ik-i V^Ik) as follows: For x £ Ii with < I < k < n the successors 
in Ik of X are given by 

sk{x) = {y £ Ik\pi{y) = x}. 
We assume that no branches die out, i.e., for all x G /q U . . . U In-i 

Snix) / 0. 

With the additional assumption |/o| = 1 we would obtain a genuine tree structure yet this 
is not needed in the following. For a probability distribution fj, on Ii, I < k, define the 
probability distribution /i^'^ on Ik as the projection of ^ to Ik- For x € Ik, 

f^^Hx) = Ksiix))- 

For < k < n, denote by gk,k+i £ B[Ik) an unnormalized relative density between ^k and 
li^l^-- For ah / e 5(4) 

t f\ _ t^k{ fgk,k+l) 
fJ'k[9k,k+l) 

Denote by Kk+i : Ik x Ik+i — >■ [0, 1] a Markov transition kernel for which 

fik+i{f) = f^kU^k+iif)) 

for all / G B{Ik+i). Any pair of probability distributions Hk and fik+i with full support on, 
respectively, Ik and Ik+i can be related through a pair {gk^k+i, Kk+i)- Moreover Kk+i is 
unique and gk,k+i is unique up to a normalizing constant. For x £ Ik and y G Ik+i, Kk+i is 
given explicitly by 



Kk+i{x,y) 



otherwise. 



The tree structure, concretely, the fact that the states in Ik are not connected by Kk, is 
a simple model of a multimodal state space: The elements of Ik stand for components of a 
continuous state space which are separated by regions of very low probability. For the particle 
dynamics we consider subsequently, the consequence is that particles can move between 
different branches only through the resampling step but not through the mutation step. This 
is consistent with our aim of studying, how helpful the resampling step is in overcoming 
problems associated with multimodality. 



This model is a special case of the framework of Section 2.2 The following lemma gives an 
explicit expression for qj^kif)- 
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Lemma 3.1. For 0<j<A;<n, / G B{Ik) and x G Ij we have 



In particular, 



QjAm^) = ^ ; (14) 



hAf)i^)\ < (max|/(y)|)<7,,fc(l)(^) (15) 



< ( max|/(?/)| I max 



Proof of Lemma 3.1, Observe that for x G Ij and / G B{Ii.) we have for x ^ y €z Ij 

Thus we can write 

qjAf)i^) = Yl 9i,fc(/lKfe)})(2;) = qj,k[fl{s^(x)}){x), 

since Qj^kif) is Hnear in /. Therefore we have 

/^fe(/l{sfe(x)}) = f^j{Qj,k[f'^{s,(x)})) = l^j{x)qj,k(^fl{sUx)})ix) = t^jix)qj,k{f)ix), 



which can be rearranged into (14). (15) follows from 



\qj,k[fl{s,ix)}){x)\ < fmax|/(y)|j ^ (l|,^(^)|) (x) = fniax|/(y)|j 



\yeik ) \ ^ >n \yeik J fJ.j{x) 

and the definition of fi^'' . □ 



3.2. Sequential MCMC. We next apply to our model the error bounds of Theorem 2.3 
To achieve this we need to define a series of norms || • ||j on B{Ij) and find constants dj^k such 
that the inequality 

max(||l||,||g,-fc(/)2||,,||g^.,(/)|||) <d,-fc (16) 
is satisfied. We choose || • ||j to be the maximum-norm on B{Ij), i.e., for / G B{Ij) 



j = max|/(x) 

X&Ij 



The following proposition gives a choice of constants dj,k which guarantee that ( 16 ) is satisfied 
and shows that these constants can also be used to bound the remaining quantities arising 
in the error bound of Theorem 12.31 



Proposition 3.2. For all j < k, inequality (16) is satified for the constants 

, ( ^r'(^)V 

djk = max " 

^xeij fij{x) J 

Moreover for all f G B{Ek) 

Var^^.(g,,fc(/))<yd^||/||i, (17) 
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as well as 

k 

% < / \/ dj k o.f^d, vu < max > \/d 
j=0 - i=0 

Proof of Proposition 3^. Observe that we have = ||/|||, ||l||j = 1 and by Lemma 

\\qjAf)\\j < 
Moreover by the same lemma we have 



3.1 



kj,A:(l) 



max — — - 



(18) 



Thus ( 16 ) is satisfied for the constants dj^k we defined. 
For the bound on Ya,i^.{qj^k{f)) note that 

Var^^(g,-fc(/)) < l^j{q,,kiff) < ll/lll HM^k f^MA'^)) < V^ll/lll- 
This immediately implies the bounds on % and v^. □ 



In order to apply the error bound of Theorem 2.3 it is sufficient to control the constants dj^k 
defined in the proposition, dj^k is large when a node at level j, which carries little mass, 
has offspring at level k, which (in sum) carries considerably more probability mass. Notably, 
the constant dj^^ does not take into account any further branching of the state space which 
occurs at levels j + 1, . . . ,n. We next set these bounds into perspective by deriving a lower 
bound on the asymptotic variance 

k 

Varr(/) = 5^Var^^,(g,,fc(/)) 

3=0 

for the test function / = 1 € B[Ik). 
Proposition 3.3. 

varr(i) = x:E/^r''o 

i=0 xelj 



1 =T.^'k\<ijM^)-^) 



Proof of Proposition 3.3. Observe that 



Var..(g,-fc(l)) 



/^j(9j,fc(l))^- 



By Lemma 3.1 we have 



By the fact that 



we can thus write 



iQi,kmx)y 



I^MiA'^)? = l^k{lf = 1 = X] 



Var,^,(g,,,(l)) = ^ ^,:\x) (^^^ - 1^ = /.r^(g,,.(l) - 1) 



Summing over j completes the proof. 



□ 
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Denote the expression for Var^.{qj^ki^)) from the proposition by Vj^k, i-e., 



flj{x) 



dj^k may be large even when vj^^ is smalh dj k is large if the successors at level k oi x £ Ij 
are - relatively - much more important under fi^ than x is under fij. In this case Vj^k may 
still be small if the absolute importance of the successors of x is small under /x^. In short, 
Vj^k may be much smaller than dj^k if the largest (relative) gains in importance are made by 
regions of the state space that remain (absolutely) unimportant. 



As a by-product, note that from the proof of Proposition 3.3 we immediately get an upper 
bound on Var^^. (g^ fc(/)) which is sharper than (17): 



Var^^,(gj-fc(/)) < /Xj(gj-fc(l) 



d 



where dj^k is defined as 



d 



E 



We obtain corresponding sharper upper bounds on Vk and Vk- This allows to bound the 
leading term in the error bounds of Theorem 



2.3 



using dj^k in place of J dj,}^. 



3.3. Sequential Importance Sampling. For the purpose of comparison, we next introduce 
a Sequential Importance Sampling algorithm for the tree model and give an explicit expression 
of its approximation error for a class of test functions. 

In Sequential Importance Sampling, particles are moved independently according to the ker- 
nels Xfc. Afterwards, importance weights uj are calculated for the particles which allow 
to obtain an estimator for /i„ through a weighted empirical measure of the particles. In 
the present framework. Sequential Importance Sampling is equivalent to simple Importance 
Sampling between the probability distribution 7r„ on X„ given by 



and iJLn- For simplicity, we consider only unnormalized Importance Sampling, i.e., we assume 
that we can calculate the weights exactly (and not only up to a normalizing constant). 
This has the advantage that we do not have to consider a bias introduced by normalizing 
the particle weights through their sum. Otherwise, the algorithm corresponds to, e.g., the 
Annealed Importance Sampling algorithm of Neal j24] . 

Instead of a system of particles, it is thus sufficient to consider only the vector of particles 
(^ri)i<«<A^ which are distributed independently according to 7r„. We define the importance 
weight function a;„ G B{In) by 

^n{x) = for all X e In 

vr„(x) 

Then for / G B(In) the Sequential Importance Sampling estimator rjn{f) is given by 



1 ^ _ 



i=l 
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Vnif) is an unbiased estimator for Hn{f), i-e., 

We next calculate a formula for the quadratic approximation error for test functions of the 
form / = 



we have 



Lemma 3.4. For x ^ In and f 

E[\Uf)-Mf)f 



N 



7r„(x) 



1 



Proof of Lemma \3.4\ To prove the lemma we only need the following calculation based on 
the unbiasedness of "rj-nif)'- 



nvnif)-Mf)\' 



E 



N ^ 




fJ-Ax) 



Ar2 

i=l L ^ 

4 ( vrn(x) (^^^- ^J.n{x) 
N \ XTTnix) 



+ (1 - 7r„(x))/i„(x)^ 



fJ-n(X) 

N 



1 



7rn(x) 



□ 



We thus see that Sequential Importance Sampling can only perform well if the distribution 
TTn is sufficiently close to more precisely, if no state which is unimportant under tt^ is 
important under 

3.4. Example: Weighting or Resampling? We now apply the error bounds we just de- 
veloped to a concrete example depicted in Figure [T] Our aim is to show that in this case 
Sequential MCMC, notably, its Resampling step, succeeds in a multimodal setting in which 
Sequential Importance Sampling severely suffers from weight degeneracy. Section |3.4.1 
troduces the setting of the example. Section 



3.4.2 



derives upper bounds on (7^- ^(1). Sections 
3.4.3 and 3.4.4| contain the error analysis for, respectively. Sequential MCMC and Sequen 



m- 



tial Importance Sampling. Section |3.4.5 closes our comparison of Sequential MCMC and 
Sequential Importance Sampling by discussing some further examples. 

3.4.1. The Model. We consider the sequence of state spaces Iq, . . . , In given by 

4 = {Ok, ■ ■ ■,kk}- 

Thus the elements of 4 are the natural numbers from to fc, indexed by k in order to keep 
the notation clearer. For / > k, the predecessor in I/, of ji E Ii is given by jk if j < k, 
otherwise it is k^' 

jk, j < k 
kk, if j > k. 



PkUi 
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Figure 1. Weighting or Resampling? 



siUk) 



We thus have a simple tree structure where from level k to level k + 1 the "largest" node 
kk has two successors, kk+i and {k + while all other nodes jk have only one successor 

jk+i- Accordingly, fov I > k and jk G Ik, the successor function is given by 

{jl}, iij<k 
{ku...,li}, if j = k. 

We define the sequence ^uq, . . . , Hn implicitly through gk,k+i and K^+i- We choose the weight 
function gk,k+\ £ B{Ik) such that only the mass of kk is modified while the relative masses 
of the other nodes remain the same: 

1, if j < A; 
29 with > 0, if j = k. 

> [0, 1] is chosen such that -f^fe+i(ife, •) is the uniform 



gk,k+i{jk) 



The transition kernel Kk+i '■ Ik >^ Ik+i 
distribution on the successors of j^: 

Kk+l{jk,ik+l) = 



li i = j < k 

if j = k and i e {k,k + 1} 
otherwise. 



Observe that for ^ > ^ we have two countervailing effects: One from the kernels Kk and one 
from the functions Qk.k+i- On the one hand, the kernels Kk favor that mass is concentrated 
on jk with small j. If we had a constant function gk,k+i (i-e- = |), /ik would be a geometric 
distribution with parameter | and maximum in Ok ■ On the other hand, the weight functions 
gk,k+i move mass to the largest node kk- As becomes clear from the explicit formula for i^k 
calculated next, the case of > 1 which we mainly consider is the case where the second 
effect is sufficiently strong in the sense that Hkikk) > IJ'kUk) for j < k — 1. As 9 approaches 
1, jUfe converges to the uniform distribution on Ik- The cases where 9 < 1 are largely omitted 
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in our error bounds, not because they are more difficult, but because they are less interesting 
and would need a largely separate analysis. 



Corollary 3.5. For jk G Ik we have 
where the normalizing constant Zk is given by 



if j < k 
if 3 = k, 



fe— 1 

j=0 



(19) 



Moreover for 9^1, 



+ 



1 



(6*^-1). 



(20) 



The corollary is an immediate consequence of our choices of gk,k+i and Kk+i- Thus for > 1, 
^k can be characterized as follows: It is a geometric distribution with maximum in {k — l)k 
on Ofc, . . . , (A; — 1)^. Additionally we have fik{{k — l)k) = fJ-kikk). 



3.4.2. Controlling qj^k- From here on we mostly focus on the case 9 > 1. In order to apply 
the error bounds of Section 3.2 we have to study the expressions qj,k{^) for this example. 
This is begun in the following lemma. 



Lemma 3.6. For < k < I < n, we have 

(lk,i{'^){jk) = ■ 

Furthermore for 9 >!, 



Zk 
Zi 

Zk.Zi-k 
Zi 



if 2 < k 
if j = k. 



max qk^l{l){jk) 
jk&h 



ZkZi- 



Proof of Lemma 3.6. Recall from Lemma 3.1 that 

w(«Kifc)) 



9fc,i(i)0fe) 



/^fc(jfe) 



Thus for jk 7^ kk Corollary 3.5 immediately implies 



qk,i{'^){jk) 



iJ-kUk 



Zk 
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fj.l{{ki, . . .,//}) 



fJ'kikk) 

Zk (o' + Etl0'^'' 



Zi 
Zk 

% \ 

ZkZi-k 



0k 

l-k-l 



l~k 



i=0 



Observe from (19) that Zk < Zi and thus for jk / kk (jfc) < 1- Since both fik and 

are probabihty measures and since fj,k{qk,i{^)) = = 1 t^is impHes 



max^fc^;(l)(jfe) = qk,i{l){kk) > 1- 



□ 



Thus in order to control (7fc,/(l) we need bounds on the constants Zk- The following lemma 
gives two pairs of bounds on Zk- The bounds in (21) get sharp as 9 approaches 1 while the 
bounds in (22) get sharp as 9 gets large. 



Lemma 3.7. We have for 9 > I 



anc 



where we define 



{k + l)9 <Zk< (A; + 1)0^ 
29^<Zk and, if 9 > I, Zk < p{9)9^ 

p{9) = 2 + 



1 



(21) 
(22) 
(23) 



Proof of Lemma 3-1. The bounds in \2l\ and the lower bound in (22) follow immediately 
from (19) and from the fact that for k > i we have 9^ > 0*. The upper bound in (22) follows 



from (20) since 



Zk = 9'' + 



1 



(r 



1)< 1 + 



1 



2 + 



1 



1 



□ 



We thus arrive at the following upper bound on ||gA;,i(l)||A:- 
Corollary 3.8. For k < I and 9 > 1 we have 

||(Zfc,z(l)|U<mm(^ — , — — ^ 

Proof of Corollary \3.8[ By combining each time one lower bound and one upper bound from 
Lemma |3.7| we obtain four upper bounds on 

ZkZi-k 



\Qk,iW\\k 



Zi 



Applying the inequalities (A; + 1)(/ - A; + 1) < \il + 2f and (A; + 1)(/ - A; + l)/(/ + 1) < 1(1 + 2) 
completes the proof. □ 
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For 9 sufficiently close to 1, the upper bound 



(24) 



which is obtained from using both directions of (21), is the sharpest one. For sufficiently 
large 9, the bound 



< 



(25) 



obtained from (22) is best. Depending on the values of k and /, one of the two other bounds 
may be even better for intermediate values of 9. Finally, note that the third and fourth 
bounds also apply to 9 = 1 since they do not rely on the upper bound from (22). 



It is quite intuitive, that for ~ 1 our bounds on qj^ki^) depend more sensitively on k. With 
a large value of 9 mass is concentrated quickly in the highest branch of the tree such that 
the sequence = fikikk) varies relatively little in k. For ~ 1, mass is accumulated only 
slowly in kk as k increases such that the same sequence Uk is increasing substantially in k 
at least for small values of k. This increase is reflected in the fact that our upper bound 
on Qj.ki^) is increasing with k in that case. Put differently, for 9 close to 1 and k not large, 
the distributions /ifc are not very concentrated (i.e. close to the uniform distribution) and 
thus more costly to approximate. As we will see, the approximation error of our algorithm 
is indeed of worse order in n at ^ = 1 than for 9 > 1 (or ^ < 1). This can also be seen as an 
elementary manifestation of the critical slowing down phenomenon. 



3.4.3. Error Bounds for Sequential MCMC. In the following we give two error bounds, both 



based on Theorem 2.3 one which degenerates as 9 approaches 1, and one which does not 
degenerate but which is worse for 9 sufficiently greater than 1. Before we begin, note that 
a dependence on the parameter n enters the error bound from two sources. While the two 
terms of the error bound of Theorem |2.3| are, respectively, linear and quadratic in n, we 
obtain a stronger dependence on n in Proposition 3.11| below since n is also the size of the 
state space /„ and a parameter of the distribution To confirm that this difference between 
the results is not an artefact of our upper bounds, we calculate the asymptotic variance in 
the case 9 = 1 explicitly in Lemma 3.12 at the end of this section. 



The first result, for 9 sufficiently greater than 1, is based on the bound (25), i.e., we choose 

\\qk,iml<dk,l = ^, 



with p{9) as defined in (23). 



Proposition 3.9. Consider 9 > I, N > p{9)^ (n + 1) and f G B{In)- Then we have 



2 (p{9fn + l , _6 (n+1) 



N 



Ar2 



Proof of Proposition 3.9. In order to apply Theorem 2.3 we have to control the constants 
introduced therein. By our choice of dj^k, we get 



dk < 



pmk + l) 
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Since this upper bound is increasing in k it also applies to dk- Furthermore by (17) we have 



EVar^^(g,,„(/))<^(n + l) 



j=0 



This implies 



and since this upper bound is increasing in k it also applies to Vk- Inserting these results into 
Theorem |2.3| completes the proof. □ 



These bounds degenerate quickly as 9 approaches 1 since p{9) gets arbitrarily large then. To 
demonstrate that we obtain reasonable constants in our bounds for sufficiently large 9, we 
give the following result derived from the special case 9 = 2 and thus p{2) = 3. 

Corollary 3.10. Consider 9 >2, N > 81(n + 1) and f e -B(/„). Then we have 



nw:!if)-Pnim< 



9 n + 1 {n + lf 

__ + 729^^ 



We now turn to a bound which does not degenerate at 9 = 
rely on the bound 

(/ + 2)4 



kfc,Ki)llfe < dk,i 



64 



1. For the sake of simplicity we 



(26) 



from Corollary 3.8 instead of the bound (24) which is sharper and has a better order in I for 
~ 1 but which degenerates quickly as 9 increases. 

Proposition 3.11. Consider 9 > I, N > j^in + 2)^ and f G B{In). Then we have 

(n + 2)8^ 



iE[k^(/)-M/)l']< 



1 {n + 2f 1 
8 N ^ 128 



iV2 



Proof of Proposition 3.11. By our choice of dj^k, we get 



dk < 



+ 2)5 
32 



(n + 2) 



2 ,^ (k + 2) 
„, and Vk < — 



Since this bound is increasing in k it also applies to dk- Furthermore by (17), we have 

n 

5] Var^^.(<7,- „(/))< 
j=0 

Since the latter bound is increasing in k it also applies to Vk- Inserting these results into 
Theorem 2.3 completes the proof. □ 



As noted above we used in Proposition 3.11 a bound of order on ||'?fe,n(l)||fc instead of 



relying on (24) which may have led to a better order at least for 9 close to 1. Thus we 
expect that the error bound of Proposition 3.11 can be improved concerning the order in n. 
In Section 3.4.4 we show however that the approximation error of Sequential Importance 
Sampling is growing exponentially in n in this example. Thus Proposition 3.11 is strong 
enough to make our point that the resampling step in our Sequential MCMC algorithm 
overcomes the problem of weight degeneracy. 

To close our analysis of the error bound for 9 close to 1 , we explicitly calculate the asymptotic 
variance - and thus the leading coefficient in the error bound of Theorem 2.3 - for the case 



SEQUENTIAL MCMC METHODS IN MULTIMODAL SETTINGS 



20 



6 = 1 and / = 1 G B{In). This asymptotic variance is quadratic in n which proves that it is 
no artifact of our upper bounds, that we do not achieve as good an order in n in Proposition 



3.11 as in Proposition 3.9 



Lemma 3.12. For 9 = 1 we have 



Var-(l) = 5^Var^^,(g,,„(l)) 

j=0 



Proof of Lemma 3.12. By Proposition 3.3 we have 



Var^*^(l) = ^^iL'j, where Wj = 



j=0 



n^{n — 1) 
12(n+ 1)' 

flj{x) 



Now observe that for 6 = 1 and for ah x G Ij we have 



^j{x) 



j + 1 



and 



Thus we have 



^II+T^ forx=jj, 



1 

n+l 



otherwise. 
5]/i-^(x)(0- + lK^(x)-l) 
1 + (j + 1) ^ /i-^ 



^Hx? 



xeij 



It is then straightforward to calculate that 



Var-(1) = ^. 

j=0 



n?{n — 1) 
12(n + 1) 



which completes the proof. 



□ 



3.4.4. Weight Degeneracy of Sequential Importance Sampling. We now turn to the analysis 
of Sequential Importance Sampling as introduced in Section 3.3 for the present example. We 
have 

2^^+^ for j <n 



To prove that depending on the value of 6 the approximation error of rjnif) may grow 
exponentially in n, consider the approximation error for the test function / = l{n„}- 

Corollary 3.13. For f = l{n„} o-f^d 6 > we have 



E[|^n(/)-^n(/)P] 



2"-l 



for 6 / 1, 



4^ ^^^^ = 1 



Moreover, E[|T7ri(/) — /^n(/)P] grows exponentially in n whenever 6 > 2 2 , 
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Proof of Corollary \3.13 . The explicit formula for the error is a direct consequence of Lemma 
3.4: We have T^ninn) = 2~", and the representation of /x„ given in Corollary 3.5 yields 

for 0^1 and 



n+l 

for 9 = 1. The error grows exponentially in n whenever 2"^^" tends to infinity in n which is 
the case for 6 > 2~2, □ 

Thus Sequential Importance Sampling suffers from weight degeneracy when approximating 
/ = even in some cases (i.e. 2~2 < ^ < 1) where fJ-n{nn) is decreasing exponentially 

itself. 

3.4.5. Further Examples. The fast degeneration of Sequential Importance Sampling in the 
previous example stems from the fact that the particles' movements only depend on the 
kernels and do not take into account the reweighting through the functions gk,k+i- It is 
easy to construct a (somewhat artificial) example where this turns out to be an advantage 
and where accordingly Sequential Importance Sampling outperforms Sequential MCMC. This 
is done in the following. The notation of the previous example is retained unless otherwise 
noted. 

Consider the sequence of state spaces Iq = {Oq} and = {0^,1^} for k = 1,...,3. De- 
fine a sequence of probability measures fik on 1^ through fio{%) = 1> {/"i(Oi), ^i(li)} = 
{^3(03), ^3(13)} = ih^} ^"^^ {Ai2(02),/X2(l2)} = {a, I - a} for some a G (0,1). The tree 
structure is given by pk{Ok+i) = Ok, Po(li) = Oq and, for A; > 0, pk{lk+i) = Ifc- This implies 
that 

Ki(Oo,Oi) = iri(Oo,li) = ^, 
while all other transition kernels are trivial, i.e., for k > 1 and j G {0, 1}, K^ijk, jk+i) = 1- 

We first consider the approximation error of Sequential Importance Sampling. The Impor- 



tance Sampling proposal distribution vrs coincides with /X3. Thus from Lemma 3.4 we obtain 
the following: For / = Ijos} 



,2, TO(03)^ /I ,\ 1 



E[fe(/)-.3(/)l1 = ^(^^-lJ = j^ (27) 

Observe that this error is independent of a. When moving from fii to fi2, the weights are 
changed, but this change is removed when moving (back) to ^3 and throughout the particles' 
movements are unaffected. So to say, the particles "accidentally" do the right thing when 
moving from no to m. To see this, we replace //i by which is essentially the same as /i2, 
{//'^(Oi), /i'^(li)} = {a, 1 — q}. Intuitively, this might make the problem easier, because it 
leads to a "smoother" sequence /i^. Yet the opposite is the case since the proposal distribution 
TTg is now given by vr3(03) = a and T^'^^ilz) = I — a. Accordingly we get the error bound 

E[|%(/)-/i3(/)P^ ^ 



4iV Va 

which gets arbitrarily large for small a. 
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Now we consider the asymptotic variance of Sequential MCMC for the same example, again 
with the original /ii and with the test function / = IjOg}- We thus have to evaluate 

3 

Varr(/) = 5]Var^^(g,,3(/)). 

j=0 



Using the formula (14) for qj^kif) it is straightforward to calculate that 

1 

2' 

J r 

2a 



Q'0,3(l{03}) = 2' Q'l,3(l{03}) = l{Oi}> 

g2,3(l{03}) = and g3,3(l{03}) = 1{03}- 



Accordingly, we have Var^(,(go,3(/)) = 



Var^i(gi,3(/)) = Var^3(g3,3(/)) = ^ 



and 

-(- 

4 Va 

Thus the asymptotic variance is given by 



Var^,fe,3(/)) = 7(r-l 



Varr(/) = ^ f- + 1 
4 \a 

Recall that the asymptotic variance also coincides with the coefficient of the leading term 
in our error bound of Theorem 2.3 Thus we observe that the approximation error gets 
arbitrarily large for small values of a. This is in contrast to the error (27) of Sequential 
Importance Sampling for the same example which is independent of a. 

Changing ni to fi'^ with /x'^(Oi) = a and /x'i(li) = I — a does not lead to a qualitative change 
of the error bound. We then get 

Q'lAhoa}) = ^l{Oi} and Var^/ (g'i_3(/)) = ^ (^i + 1^ , 
which leads to an asymptotic variance of 

Observe that again - despite the fact that the sequence /^O; /^ij ^2, A*3 varies more strongly 
than Ml; A'2, ^3 - the asymptotic variance for small values of a is larger under the second 
sequence than under the first sequence. The reason for this lies in the fact /xi is a better 
approximation of fj,^^ than 

For a > ^, the asymptotic variance under fi[ is smaller than the one under /ii and both are 
well-behaved. But in this case the asymptotic variance for /' = Ijig} increases more quickly 
under fi'^ than under fii as a approaches 1. In this sense, the asymptotic variance is more 
stable under fii than under fj,[. 

We thus close our comparison of Sequential Importance Sampling and Sequential MCMC 
on trees with the following conclusion: Sequential Importance Sampling works well if the 
proposal distribution 7r„ constructed from ^uq and the transition kernels Kk is sufficiently 
close to the target distribution Sequential MCMC works well if the distributions are 
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sufficiently close to the projected distributions fin'' ■ While there is no obvious relationship 
between these two properties, it seems clear that Sequential MCMC is more suited to ap- 
plications where the relative densities gk,k+i pl^-y ^ significant role. Furthermore for both 
algorithms it is easy to construct examples where they perform arbitrarily bad. Finally note 
that for the last example we only considered the asymptotic variance of Sequential MCMC. 
In order to obtain good constants in our error bounds, we also need that fij is sufficiently 
close to fif^'' for j < k < n. 

4. Lp-BOUNDS UNDER LOCAL MIXING 

In this section we consider a more standard Sequential MCMC framework where the distribu- 
tions fik all live on the same state space and where the transition kernels represent (many 
steps of an) MCMC dynamics with target distributions fik- We assume that the MCMC 
dynamics mix well only within the elements of increasingly finer partitions of the state space. 
These partitions take the role of the tree structure of Section [3) Section AA_ introduces the 



setting and connects it to the error bound of Theorem |2.2[ Section 4.2 derives stability of 
the Feynman-Kac propagator qj^k- Unlike in Section [3| we explicitly take into account the 
behavior of the dynamics within modes in this section. For this reason, we need two types 
of additional assumptions: a uniform upper bound on relative densities and sufficiently good 
mixing within modes. 

4.1. The Model. We return to the setting of Section [2] and make a number of additional 
assumptions: We assume that all the distributions fik live on the same state space = E. 
We assume that the kernel Kk is stationary with respect to fik- Accordingly, we assume that 
the functions gk,k+i G B{E) are unnormalized relative denisties between Hk and Hk+i- 

In place of the tree structure of Section [3] we now introduce a sequence of partitions of E. 
Let Io,...,In be a collection of finite index sets. Define / = Iq U . . . U /„ and for < k < n 

I>k = 4+1 U . . . U I„ and /<a; = /q U . . . U 4„i. 

For all j G I there is a set Fj £ B{E) with fj,Q(Fj) > 0. Moreover, we assume that for all 
< A; < n the collection (Fj)ji^i^ is a disjoint partition of E. We assume that partitions 
successively get finer. For 1 < k < n, assume that for all j £ 1^ there exists an i S Ik-i 
with Fj C Fi. Thus for 0<k<n — 1, a well-defined predecessor function : I^k ^ Ii^ is 
characterized as follows: For l<k<l<n, jGli^. and i £ Ii define 

Pk{i)=j if Fi<ZF,. 

Conversely, define a successor function Sk ■ I<:k — ^ ^(4) via 

Sk{i) = {j G h\piU) = for i e Ii with < / < A;. 

Thus, for / < A; and i £ Ii, the collection {Ij)j£s^.{i) is a disjoint partition of Fi. We add 
the simplifying assumption that particles move between partition elements only through the 
resampling step. 

Assumption A. For I < k < n and j G Ik let Kk{lFj){x) = for all x £ E \ Fj. 

This assumption ensures that if / has support only in Fj, j £ Ik, then Kk{f) has support 
only in Fj as well. While this technical assumption will not be literally fulfilled in most 
applications of interest, it can be seen as an approximation of the fact that particles will 
move between different modes only extremely rarely through the MCMC dynamics. 
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In order to apply the error bound of Theorem 2.2 we need to introduce a sequence of norms on 
E. Unhke in the analysis under global mixing in [26] we rely only on local mixing properties. 
Thus we replace the Lp-norms of [26] by stronger norms, which are composed of local Lp- 
norms. To introduce these norms we need a few additional definitions. For < A; < n and 
j £ I, denote by Hkj £ Mi{E) the restriction of /ik to Fj: For / G B{E), 

It proves to be convenient to view |Ufc j as a probability distribution on E (and not on Fj). 
Note that we define /i^j for all j € I (and not only for j E I^.). Furthermore, by Assumption 
[X| /Cfc is stationary with respect to fi^j for all j G I^- Now for 0<k<n, j£l and p > 1, 
denote by || • \\k,j,p the Lp-norm with respect to fi^j- For / G B{E), 



k,j,p = /^fc,i(|/r)f • 

Next define the norm || • ||fc,p to be the maximum over the Lp-norms with respect to fikj with 
j £ h: For / G B{E) and < A: < n. 



With this choice of norm we have 



k,p = max \\j iijtj^p. 



ip(Mfc) - ll/llfc,p- 



Now define Cj^kiPil) to be the constant in an Lp-L^-bound for qj^k'- For p > q > 1 and 
< j < k < n we have 

Iki,fc(/)|li,p < Cj^k{p,q)\\f\\k,q for all / G B{E). 



The following proposition shows how the quantities in the error bound of Theorem 2.2 can be 
controlled in terms of the constants Cj,fc(p, q). Accordingly, Section 4.2 is devoted to studying 
the constants Cj^kiPi l)- 

Proposition 4.1. Fix p > 2 and define 

Cj,kiP) = (cj.fc (P) I) > Cj,k{'^P,P)' 

This choice of cj^k satisfies i.e., for p >2,0<j<k<n and f G B{E) we have 
maK{\\llJq^,k{f)%,pAhMf)\\l^^ WfWlp- 

Moreover 

Var^^{q,M)<CjM'^,m\\l2 

and 

\\qk,k+i{l) - l|U,p < sup \gk,k+iix) - 1|. 



Upper bounds on the quantities % and Vk from Theorem 2.2 follow immediately from the 
bound on Var^^- (g^- fe(/)) and from ||/||fc,2 < ||/||fc,p- 

Proof of Proposition We have ||l||j,p = 1, 



\\qjM)\\Ip < hj,k{f)%,P = hjAf)\\l2p<cjM'^p,p?\\f\\lp 

and 

ii„ /f2^n f^P\ \\f2n ~ f^P\\\f\ 

I k,p' 



\Qj,k{f)\\j,P < Cj,k (P, I) ||/^||fc,E = Cj-fc (p, " ^"^ 
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This shows that ^ is indeed satisfied with constants Cj^k{p)- The upper bound on Var^^. (qj^kif)) 
follows from 

Var^^.(g,-fe(/)) < \\qj,k{f)\\U,,) < HMWh <Cj,ki'^,mf 111,2- 

The upper bound on Hf/fc.fc+ill) — M\k,p follows immediately from qk,k+ii^) = 9k,k+i and the 
definition of || • ||fc^p. □ 

4.2. Stability of Feynman-Kac Propagators under Local Mixing. We begin with a 
few more definitions. For j £ I, let mk,k+i ( j) be the relative change in the mass of Fj between 
fik and /ifc+i, 

fik+i{Fj) 
mk,k+i{j) = T^TV 

Furthermore, for < k < n — 1, denote by Qk fc+i the normalized relative density between fXk 
and /ifc+i, 

iJ'k[gk,k+i) 

Next we define restricted relative densities: For 0<k<n — 1, jGl and x G E, 

9k,k+i,ji^) = ~ r^9k,k+i{^) ^Fj{x). 

'I'-kjk+l \J ) 

Observe that with this choice of Qk^k+ij have for / G B(E), < k < n — 1 and j £ I that 

, _ fJ-k+lif^Fj) _ 1 ^fe(/5fc,fc+llFj) _ , , 

/^fe+ljUj — / \ — ~ TTV 7^=7^; — fJ'kj[9k,k+l,jJ )^ 

fik+i{Fj) rnk,k+i{j) fJ-kiFj) ' ^ '■' 

i.e., gk,k+i.j is a relative density between fikj and ^k+i,j- 

We assume a uniform upper bound on restricted relative densities. 

Assumption B. There exists 7 > 1 such that for every < A; < n — 1, every j G Ik and 
every x G Fj 

9k,k+i,j(x) = 7^5^,^+1(2;) < 7- 

Note that in the extreme case, where gk k+i is constant on each component Fj with j E Ik, 
we can choose 7 = 1. This extreme case corresponds roughly to what we assumed in Section 

13 

It proves to be convenient not to work with qj k directly but to work with qj^k defined as 
follows: For 1 < A; < n - 1 define qk,k+i ■ B{E) ^ B{E) by 

qk,k+i{f) = Kk {9k,k+if) ■ 
Furthermore, for 1 < j < k < n the mapping qj^k ■ B{E) — t- B{E) is given by 

= qj,j+i{qj+ij+2{- ■ ■ qk-i,k{f))) and qk,k{f) = /• 
qj^k and qj+i,k are related through 

Qj,kif) = 9j,j+iQj+i,k{Kkif))- 
In Lemma 4.10 below, we show how Lp-L^-bounds for qj^k can be used to obtain Lp-L^-bounds 
for qj^k- 
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We proceed by considering first L2-bounds for one time-step and then iterated -L2-bounds. 
From these we conclude one-step Lp-bounds and then, in Proposition 4.7 iterated Lp-bounds 
for Qj^k- Afterwards, we show how to extend this result to Lp-L^-bounds, using local hyper- 
boundedness, and to the original family of operators qj^k- Proposition 4.11 concludes the 



bound for Qj^^ needed in order to make the constants in the error bound of Theorem 2.2 
explicit. 

We consider mostly inequalities which bound \\qj,kif)\\j,i,p against max;g5^(j) ||/||fc,z,p, for i G 
Ij. The inequalities which bound ||gj,fc(/)||j,p against ||/||a;,p can then be concluded by taking 
the maximum over i S Ij. So to say, the latter inequalities are the final results while the 
former are more useful tools in proving further results. 

In order to keep track of how mass is shifted between different components, two more defi- 
nitions are needed. For < j < k < n and i G Ij define by ^(i) the following iterated 
version of mjj+i(i): 

fc-i 

Mj^k{i) = max TT m„+i(pr(0)- 

lesk(i) 

This is the maximal product of relative mass changes one has to go through when moving 
from Fi, i € Ij to one of its successors Fi, I € Sfc(i) C I/^. For the transition from r to r -|- 1 
the relative mass change of the predecessor of Fi at level r is taken into account. Observe 
that for i G Ij we have the relation 

Mj,k{i) = mjj+i{i) max Mj+i,A:(0- (28) 
;esj+i(4) 

Furthermore, we define for < j < k < n the constant Aj^k by 

Aj^k = maxM.fc(^)- 

ieij 

Before we come to local mixing properties and Lp-bounds, we briefly look at the Li-case. 

Lemma 4.2. For 0<j<A;<n, / G B{E) and i £ Ij we have 

UjMDWjM < Mj,k{i) max ||/||fc,,,i. (29) 
zesfe(«) 



Moreover, 

Proof. We can write 



UjADhi < Aj,k\\f\\k,i- (30) 

\Qj,k{f)\\jM = H,ii\Kjigjj+iqj+i,kif))\) 

< mjj+i{i)fj,j^i{gjj_^^i\qj+iAf)\) 

< mjj+i(i) max ^j+i,/(|gj+i,fc(/)|) 

zesj+i(«) 

< mjj+i(i) max ||gj-+i,fc(/)||j+i,i,i. (31) 



Iterating this bound yields 

\\qj,kif)\\j,i,i < mjj+i{i) max mj+ij+2{lj+i) ■ ■ ■ max ruk-i^kik-i 

ij+iesj+i(i) Zfe-iesfc-i(Zfc-2) 

Note that by iterating ( |28[ ) we obtain 

^jA"^) = rrijj+iii) max mj+ij+2ilj+i) ■ ■ ■ max rnk-i^kik-i)- 
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Thus applying 

k,h-iA < max \\f\\k,i,i 



in (31) yields (29). Taking the maximum over i G Ij gives (30). □ 



The proof illustrates how the constants Mj^^ii) and Aj^k come into play in our bounds. The 
same arguments appear - in less detail and alongside further complications ~ in our proofs 



for p > 1. In fact, this didactic purpose is the main motivation behind Lemma 4.2 Using 
Jensen's inequality one can easily show that for alH G I 



This implies 



\Qj,k{f)\\j,i,l < \\f\\k,i,l- 



<7,,fc(/)IUi<fmax^^)||/|U,i, (32) 



which is generally an improvement over Lemma |4.2| 

We now state the local mixing conditions behind our Lp-bounds for the case p >2. 

Assumption C. We have uniform constants a > and /? € [0, 1] such that for alll < k < n, 

for all f G B{E) and for all i ^ Ik 

lkWi(/)lll,.,2 < mk,k+i{i)'' («||/||i+i,.,2 + Pl^k+iAf?) . (33) 



One way to ensure that ( |33[ ) holds is to assume that the kernels Kk possess the following 
contraction property: There exists /) G (0, 1) such that for all 1 < /c < n, for all / G B{E) 
and for all i & Ik 

l^kAKkif - Iik,i{f)f) < (1 - p)Var^,^(/). (34) 



Then it can be shown that ( |33[ ) holds with a = (1 — p)^ and (3 = p. Moreover, (34) holding 
with a sufficiently large p is equivalent to a local Poincare inequality with a sufficiently large 
spectral gap being satisfied, see Section 5.1 of j26] for easily adaptable arguments under global 
mixing assumptions. Intuitively, a smaller value of a corresponds to better mixing and thus 
more MCMC steps. Assumption [C] immediately implies the following one-step L2-bound for 

qj,k- 

Corollary 4.3. For 1 < k < n, for all f G B{E) and for all i & Ik we have 



\qk,k+i{f)\\l,i,2<mk,k+i{ifia( max ||/||^+i;2) + ( , max Uk+i,l{f)'' 



(35) 



and 



9M+i(/)lll2 < M,k+i «ll/llfc+i,2 + /3 max Pk+i,l{f) < M,k+ii<^ + WWk+i 



,2- 



Next we iterate (35) to obtain an L2-bound for more than one step. 



Lemma 4.4. Assume a < 1. Then for l<j<k<n, f£ B{E) and i G Ij we have the 
hounds 

Il9„.(/)lli,2 < M.^kii? (max \\f\\l,^ + ( max Ufe,/(/)') ) , (36) 



and 



and 
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max 



(1 - a)2 'esfc{») 



|^i,A:(/)||j,2 < Aj^k 



1 



(1-a) 



1 UJ \\k,2- 

2 



(37) 
(38) 



Proof. Applying (35) yields 

|2 / _ /■\2 



Qj,k{f)\\j,i,2 < 'mjj+lii) a , max \\qj+i,k{f)\\j+i,l.2 + , max ^ij+u{qj+l,k{f)) ■ (39) 



Arguing as in the proof of Lemma |4.2| yields the inequality 



max Mj+i,K^i+i,A;(/)) < Mj+i^k{i) max uk,iif), 



(40) 



which can be used to bound the second term on the right hand side of (39). To the first term 
in (39) we can apply again (35) which yields again two terms, one which can be bounded 
through (35) and one which can be bounded through (40). Iterating this reasoning and 
collecting the factors rrir^r+i into terms Mj^^ gives us 



max II J 11^ ^ 

«GSfc{i) 



.2)+^ X] "''.max ^A.j(/)2 

^ r=0 



'esfe{i) 



(41) 



Applying to this the geometric series inequality yields ( [36| . Since Hk,i{f)'^ < Il/llfci2 ^'-'^ all 
I E Skii) and since /3 < 1, we can conclude from (41) that 

k-j 

\\QjAf)\\h.2 < M,-fc(i)2^a'- max 



which implies (37) by the geometric series inequality. Taking the maximum over i G L: in 



(|37|) gives (|38j). 

Our next step is the following one-step Lp-bound. 

Lemma 4.5. For 1 < k < n, for all f G B{E), for all i ^ Ik o,nd for all p > 1 we have 

ll9M+l(/)ll£,2p<"^W«V^-' 



□ 



a I max 

iesfc+i(i) 



11/11?..,.)) 



(42) 



and 



\qk,k+l 



2p 



< ^r.+i7^-^(«+ 



+ 



2p 

k+l,p 



2p 

k+l,2p- 



(43) 



Proof. We can write 



fc,fc+i(/)||fc^i,2p = /^M(l^fc(5'fe,fe+i/)P^) 

< fikAKkiglk+.m') 

= ii9fe,fc+i(5fc,fc+ii/r)iifc,i,2- 
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Using (33), we can bound this expression to obtain 



< m,,,+i(.)V^-'(«iii/riii+i,,2+/3iii/nii+i.,i) 



< m,,.+i(.)V^-^(a||/liai.,2p + 



2p 

k+l,i,p I ' 



which immediately imphes (42). (43) foUows by taking the maximum over i ^ Ik 



□ 



Next we iterate the bound of Lemma 4.5 to show how an Lp-bound for qj^j. imphes an L2p- 
bound: 

Lemma 4.6. Assume that 07^^^^ < 1 and that for some 6{p) > 1, for all 1 < j < k < n, 
i G Ij and f G B[E) the inequality 

(44) 



\qj,k{f)\\i,i,p ^ Mj,k{i)Kp) max \\j \\k,i^p 



is fulfilled. Then we have 



\qjM)\\j,i,'ip ^ Mj^k{i)5{2p) max \\j\\k,i,2p 



(45) 



with 



Moreover, we have 



6{2p) = 5{p)- 



7 p 



(1 - a72p-2)2p 



\qj,k{f)h2p<Aj,kS{2p)\\f\\ 



k,2p- 



(46) 



Proof. Define 6 = 07^^ ^. Iterating the inequahty of Lemma 4.5 and utihzing that (3 < 1, 
we get 

Whkim'hp < M,,k{if'e'^-' (max ll/llg J + 7^^-^ Yl ™. R,,rm\qrM)\\ll, 



where for I £ I,., Rj,ril) is defined by 



r-l 



Observe that for i £ Ij we have 



Rj,r{l) = JJ"lt,t+l(Pt(0)- 



Mj^k{i) = max Rj^k{l) 



and moreover for j < r < k and i E Jj 



Mj,ki'i-) = niax Rj^r-{l)Mr,k{l) 



(48) 



Thus applying (44) and then (48) to bound the factors ||gr,A;(/)||r,/,p5 we obtain from (47) the 
inequality 

k 

|2P ^ . fA\'2pnk-j I 11 #||2p \ , ^,'ip-2 a /r , i ■\2p s, I „\2p ST^ nr-l-j 11 ^||2p 



q,Af)\\Z,2p < M,A^f^9'-^ max ||/||g_2^ + J'P-'M,,k{^f'6{p)'' 



k,l,j 



r=j+l 
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Since we assumed 7 > 1 and 5{p) > 1 and since we have 



max 



2p 
k,l,p 



< max 

«esfe(i) 



2p 

k,l,2pi 



we thus have 



2p 



max 



2p 
k,l,p 



r=j+l 

By the geometric series inequaUty and our assumption of < 1, we thus get 



\qj,k{f)\y,2p < Mj,k{ifPd{2p) max 



k,l,pi 



with 



6{2p) = 5ip)- 



1-1 



This shows (45). (46) follows by taking the maximum over i ^ Ij. 



□ 



Combining Lemmas 4.4 and 4.6 we can now state the key result of this section as follows. 

Proposition 4.7. For r G N, consider p = 2*" and assume that 07^^^ < 1. Then we have 
for l<j<k<n, and f £ B{E) the inequality 

\\qj,k{f)hi,p ^ Mj^k{i)5{p) max \\f\\k,i,p, (49) 

with 



'(p) = n 71 



7 



l_2-0-i) 



a-f 



23-2\2~i 



< 



7 



r-2+2-('"-l' 



1 — 07 



2'' -2 



Moreover, 



\\qjM)hp<Aj^kSip)\\f\\k,p. (50) 
Proof. We proceed by induction over r. The cases r = and r = 1 follow from Lemmas |4.2 



and 4.4, respectively. The inequalities for r > 1 follow because Lemma 4.6 implies that we 
can choose 



7 



i=2 



07 



2J-2^2-J 



We can apply Lemma |4.6| iteratively, since aj^^'^ < 1 implies a7'?~^ < 1, for all q < p. For 
the upper bound on S{p), we apply the geometric series equality in the nominator, bound the 
term in brackets under the exponent in the denominator by l — a'y^~'^ and apply the geometric 
series inequality to the product. This shows (49). (50) follows by taking the maximum over 
i e Ij. □ 

Since the constants 5(2^) are monotonically increasing in r, we can immediately extend the 
bounds of Proposition 4.7 to general p > 1 using the Riesz-Thorin interpolation theorem (see 
Davies fS], §LL5). 



Corollary 4.8. Consider p G [2^,2^^^ ] for r G N and assume 07 
1 < j < k < n and f G B{E) and i G Ij we have 



< 1. Then for 



\qj,k{f)\\j,i,P ^ Mj^k{i)S{p) max ||j||fc,z,p 
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and 



\qj,k{f)\\3,p ^ ^j,kKp)\\f\\k,p 



with 6{p) = 5{2^'^^) and 6{2^^^) as defined as in Proposition 4.7 



We still need two more results: one which shows how to translate Lp-stability into Lp- 
Lq-stability using local hyper-boundedness, and one which relates our bounds for qj+i,k to 
corresponding bounds for qj^k- We first show how Lp-Lg-inequalities follow from our Lp- 
inequalities and a local hypercontractivity assumption on the kernels Kj. 

Assumption D. For I < j < n, we have a constant 9j{p, q) >0 such that for all i G Ij and 
all f £ B{E) we have 

\\Kj{f)\y,p<e,ip,q)\\f\y,,. (51) 

Adding this assumption for the remainder of the section, we obtain the following: 

Corollary 4.9. Consider p > 1 and q > 1. Let q G [2'', 2''"'"^] for r G N and assume 
07^'^^^^ < 1. Then for j <k <n, we have 

q-l 

\\qj,k{f)\\j,i,p < Mj^k{i)0j{p,qh " ^q) max 11/11 (52) 



and 



\qj,k{f)\kp < Aj,k0j{p,qh'^6{q)\\f\\k,, (53) 



with 5{q) as defined in Corollary ^.< 



Proof. By Assumption [D] we have 



\qj,kif)\\j,i,P < &jiP^q)'>nj,j+iii) , max \\gjj+i4j+i,kif)\\j,l,q 



and thus by Corollary |4.8| 



q-l 

\qj,k{f)\\j,i,p < dj{p,q)Mj^k{ih " ^(q) max \\f\\k,i,q- 



This shows (52). Taking the maximum over i G Ij proves (53). □ 

Next, we show how to obtain bounds for qj^k from our bounds for qj^i^k- 

Lemma 4.10. Assume that for some p > 1 and g > 1 and for fixed l<j + l<k<nwe 
have a 6 > such that for all I G Ij+i and for all f G B{E) 

\\qj+i,k{f)\\j+i,i,P ^ ^ Mj+iA^) max (54) 

resk(l) 

Then we have for all i G Ij 

\\qj,k{f)\\j,i,p < 1 p SMj^kii) max \\f\\k,r,q 

resk(i) 

and 

P-i 

\\qj,k{f)\\j,P < 7 " SAj^k \\f\\k,q- 



SEQUENTIAL MCMC METHODS IN MULTIMODAL SETTINGS 

Proof. Note that we have for i G Ij 

1 

\\qj,k{f)\\j,i,P = ^J'j,^i\9j,j+lQj+l,k{Kk{fW)'' 

1 

= fJ'j,i{\9j,j+l,iQJ+^,k{^k{f W)^ 

P-1 1 

< 7 p mjj+i{i)fij+i^i{\qj+i^k{Kk{fW)p 

p-1 1 

< 7 p mjj+i{i) max jij+i^i{\qj+i^k{Kk{fW)p 

iesj+i(i) 

p-1 

= 7" rrijj+iii) max \\qj+i^k{Kk{f))\\j,i,p 



and thus by ( 54 ) 

HMf)\\ 



2-1 



< 7 p mjj+i(i)(5 max Mj+i^ki^) max ||i^'A:(/)||fc. 



r,q 



p-1 



7 p Mj^k{i)S max \\Kk{f)\\k,r,q 



p-1 

< 7 p Mj^kii) ^ max 



k,r,qi 
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where in the last step we used that by Jensen's inequality \Kk{f)\'^ < Kk{\f\'^) and that Kk 
is stationary with respect to ^k,i for all I G /. This shows the first inequality. The second 
inequality follows by taking the maximum over i G on both sides. □ 



Combining Lemma 4.10 and Corollary |4.9| we can immediately conclude the types of inequal- 
ities needed to derive error bounds from Theorem |2.2| and Proposition |4.1[ 



Proposition 4.11. Consider p > 1 and q > 1. Let q G [2*^ , 2*^'^^] for r G N and assume 
Q,^2'' -2^2^^ Then for l<j<k<nwe have 

\\qj,k{f)\\j,p <Cj,k{p,q)\\f\\k,q (55) 

with 

^ p—l q—l 

Cj,k{p,q) = Aj,kOj{p,qh " 7 " ^{q), 



where 6{q) is as defined in Corollary ^.^ 

The stability inequalities in this section differ from those in |26j by containing the factor Aj^k 
on the right-hand side. For the case treated in that paper, i.e., \Ij\ = 1 for < j < n and 
Fi = E for all i G I, we obtain Aj^k = 1- Thus the results of the present section contain those 
obtained in there as special cases. For the case of invariant partitions treated in Eberle and 
Marinelli [14] . i.e., \Ij\ = \Ik\ for < j < k < n, we obtain 



Aj,k 



IJ-kiFi) 
max — 



(56) 



which is a discrete-time analogue of their constant. 



Compared to the setting of Section [3j the present setting is more general in two respects: We 
take into account local mixing and local variations in relative densities. To disentangle these 
two factors to some extent, consider the case where we take into account local mixing but 
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assume that the relative densities gk,k+i are constant on each of the sets Fj with j ^ Ik- In 
that case, we have 7 = 1 and the inequahty (55) in Proposition 4.11 becomes 

lki,fc(/)IUp<A,>0(p,g)^||/|lM- 

Thus, compared to the results of Section [3j we obtain different norms, we obtain the constants 
Aj^k which are similar to but greater than the corresponding constants in Section [s] 

and we obtain an additional factor taking into account hyperboundedness and local mixing. 



5. Outlook 

Since the analysis of Section [3] abstracts from most of the local structure - and thus from 
some possible problems - it should be seen as providing a rough but intuitive criterion for 
identifying settings where the algorithm works or does not work. Consider for instance the 
mean field Potts model for which slow mixing of the Parallel Tempering algorithm was proved 
by Bhatnagar and Randall p], see their paper for more details about the model. Basically, in 
this model there is a distribution //q which is unimodal and a distribution which has four 
modes of roughly equal weight. Along the transition from ^uq to three additional modes 
arise which are immediately well-separated from the initial one and have a ti ny i nitial mass, 
say e. Then we obtain huge constants dj^^ proportional to in Proposition 
the error bound. 



3.2 



and thus in 



For the mean field Ising model, Madras and Zheng ^23] proved rapid mixing of Parallel 
Tempering. The results of Section [3] suggest that the same should be true for Sequential 
MCMC: In the mean field Ising model there is basically one mode which is split into two 
modes of equal weight at some point as we move from /xq to \Xn- In this case, each mode has 
exactly the same weight as its successors and thus dj^^ = 1. Therefore we can expect a good 
performance of the algorithm. 

A similar good performance of Sequential MCMC can be expected in the problem of esti- 
mating the parameters of mixture distributions as described in Celeux, Hurn and Robert 
[3]. There, the target distribution /i„ is a distribution on the parameter space M}^^ with the 
symmetry property that assigns the same weight to all states 9' which can be obtained 
by permutating the rows of a given 9 ^ M}^^. To illustrate the source of this multimodality 
problem consider the following example: The Gaussian mixture distributions 

0.25 7V(5, 1) + 0.75 AA(0, 1) and 0.75 7V(0, 1) + 0.25 AA(5, 1) 

are identical, i.e., some permutations of the parameters correspond to the same mixture 
distribution. The target distribution /i„ of MCMC is a posterior distribution on the pa- 
rameter space and thus assigns the same weight to ^ = (0.25, 5, 1; 0.75, 0, 1) G M?^^ and 
9' = (0.75, 0, 1; 0.25, 5, 1) G M^^^^^ -phe results of Section [s] suggest that if this type of per- 
mutation symmetry is the sole source of multimodality. Sequential MCMC should work well, 
since the symmetry is retained when tempering the target distribution. Thus, the areas 
around each local mode have the same weight at all "temperatures" , dj^k = 1 • This intuition 
is confirmed by simulations of Celeux, Hurn and Robert [5J who study an example along these 
lines and demonstrate that Simulated Tempering can move between local modes while simple 
MCMC cannot. Permutation symmetries are also one source of multimodality in models from 
chemical physics, see Wales [27j. Another message of the analysis of Section [s] is however 
that multimodality caused by permutation symmetries is one of the easiest to deal with cases 
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of multimodality. Thus, examples of this type are rather hmited toy examples for testing a 
multilevel MCMC algorithm's ability to move between disconnected modes. 

The results of Section |4] convey basically the same intuition as those of Section [3] but they 
explicitly take into account local mixing and sufficient similarity of /ifc and Hk+i within dis- 
connected components. Both aspects are important to keep in mind: Assume we choose 
n = 1, let /io be a distribution for which we have excellent global mixing and let /ii be an 
arbitrary other distribution which is strongly multimodal. This setting can be projected to 
a tree where a number of leafs branch from a single root. Then the results of Section |3] seem 
to suggest, that Sequential MCMC with only these two distributions should work very well, 
since the root of the tree has mass 1 under and its successors have mass 1 under fii. This 
- obviously false - conclusion can be drawn from Section [3] since the model does not take into 
account local variations in relative densities which may lead to huge errors in the Importance 
Sampling Resampling step. Basically, in Section |3] we make the - implicit - assumption that 
relative densities are constant within each component. Similar "wrong intuitions" can be 
derived from the model of Section [3] by disregarding the fact that local mixing has to be 
guaranteed within each component. 

To sum up, the results of Section [3] are to be seen as convenient tools for developing intuitive 
results about the algorithm such as our comparison of Resampling and Weighting in Section 
|3.4[ The results of Section [4] are to be seen as first steps toward explicit error bounds 
for Sequential MCMC in actual examples. They are only first steps for two reasons: 1) 
Assumption [A] will not literally be fulfilled in most applications since the probability of 
transitions between disconnected modes will be negligible but positive. 2) While the Mixing 
Assumptions [C] and |D] are weaker than the corresponding assumptions in most of literature, 
see |25l [26 j for more discussion, they will be hard to check in most applications of interest. 
Concretely, for diff^usion processes hyperboundedness and a spectral gap follow, e.g., from a 
Logarithmic Sobolev inequality which can be verified using the Bakry-Emery criterion, see, 
e.g., pj. For the usual discrete-time MCMC dynamics on non-compact state spaces similar 
tools for proving hyperboundedness are still missing. 

Our results demonstrate that problems of the algorithm which stem from disconnected com- 
ponents gaining mass can generally not be alleviated by increasing the number of interpolat- 
ing distributions: Adding additional steps in the sequence fiQ, . . . can only increase the 
constant dj^k and Aj^^. This separates this type of problem from problems associated with 
large local variations in relative densities, i.e. a large value of 7 in Assumption [Bj which can 
be controlled fairly well through the number of interpolating distributions. The only way 
to control the constants dj^k and Aj^i^. seems to be to choose an entirely different sequence 

Finally, the results of Sections[3]and|4]suggest that, generally, a bad performance of Sequential 
MCMC is not a property of the target distribution ^„ but a property of the approximating 
sequence fiQ, . . . , fin-i which is a parameter in the algorithm, not in the problem of interest. 
So far, the theoretical literature on multilevel MCMC algorithms, i.e., Simulated Tempering, 
Parallel Tempering and Sequential MCMC has largely focused on flattening a target distribu- 
tion by tempering. In the applied literature, there are many more, sometimes model-speciflc, 
proposals for choosing a sequence of distributions such as cutting off the Hamiltonian at 
chosen minimum levels in addition to tempering, varying the system size or spatial coarse 
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graining, see, e.g. [18 ^1201 [21] . A more systematic study of methods for approximating target 
distributions seems to be an important and highly chahenging task for future research. 
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